Assignment as part of Visual Analytics Project titled VISTAS - Visualising Industry Skill TAlent Shifts.
The LinkedIn and World Bank Group have partnered and released data from 2015 to 2019 that focuses on 100+ countries with at least 100,000 LinkedIn members each, distributed across 148 industries and 50,000 skill categories. This data aims to help government and researchers understand rapidly evolving labor markets with detailed and dynamic data (The World Bank Group, 2021).
The proposed datasets for analyses will mainly come from the LinkedIn-World Bank partnership. This will be complemented with GDP per capita from World Bank.
In this assignment, the following datasets from LinkedIn-World Bank will be used:
The GDP per capita dataset, which comprises annual time-series GDP per capita (current USD) for countries, is also used.
Simple data visualisations are publicly available at LinkedIn-World Bank webpage. However, they are limited in the exploration of trends because they only allow one parameter of choice.
In terms of how variables relate to one another, there are limitations to studies done so far. Studies on relationship between GDP per capita growth rate and population growth are usually done at country level, so more studies can be done to analyse the relationship between GDP per capita growth rate and employment growth or migration in each industry. Figure 1 shows the relationship between employment growth and interstate migration (Tunny, G., 2015). Here, analysis is done at state level, but not industry level.
Figure 2 shows the relationship between GDP per capita and share of employment in business services in 2000 for countries in Europe Kox, H. & Rubalcaba, L., 2007. However, this does not show how a change in migration or employment growth in an industry will cause the change in GDP per capita.
Overall, these data visualisations mentioned are static and do not allow readers to find out how the change in GDP per capita is influenced by the change in employment and migration in each industry or skill and how the change in employment in influenced by change in migration in each industry or skill.
Through this project, we will create a user-friendly, interactive dashboard that allows individuals and countries to view their competitive advantage and understand the evolving labour markets across the world in four areas: skills, occupations, migration and industries. Comparison will be allowed at both country-level and industry-level. The project will help answer the following questions:
For individuals:
For countries:
This assignment will build regression and correlation plots to answer questions 1A, 1B and 2E.
Figure 3 shows the sketch of the proposed data visualisation.
For this assignment, the proposed data visualisation is an interactive dashboard with regression and correlation plots. It can be filtered according to country, region, year, income level, industry section, industry, skill category and skill to show different plots. Users can choose two parameters and find out the relationship between GDP per capita growth and industry employment growth/industry migration/skill migration, or industry employment growth and industry/skill migration. Correlation of GDP per capita growth, industry employment growth, industry migration and skill migration can also be determined.
The following packages are used to build the respective plots:
Unlike R Shiny, we are unable to have an interactive filter on R Markdown that can filter the dataset and directly change the regression and correlation plots. Crosstalk, d3scatter and lazyeval packages allow interactive filter on R Markdown, but are only compatible with data tables and scatter plots. Thus, in this assignment, we will first use crosstalk, d3scatter and lazyeval packages to filter and explore the distributions of variables on scatter plots, before we manually adjust the code and filter the dataset, which will change the regression and correlation plots.
First, we will install and launch the relevant R packages.
If d3scatter package has not been installed before, we can run the following two commands.
packages = c('devtools')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
devtools::install_github("jcheng5/d3scatter")
If d3scatter package has already been installed, we can run the following command to install and launch all relevant R packages. In addition to the packages mentioned in the earlier sections, the readxl package is required to read xlsx files and the tidyverse package is needed to import data (read_csv), do data cleaning and manipulation (dplyr, tidyr). The olsrr package is also used to find out detailed results of the regression plots for further discussion in this assignment.
packages = c('readxl', 'devtools', 'crosstalk', 'd3scatter',
'lazyeval', 'tidyverse', 'ggstatsplot', 'olsrr')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Next, we will load the necessary datasets for this assignment, which include GDP per capita, industry employment growth, industry migration, skill migration and industry skills need.
industryEmploymentGrowth <- read_excel("data/public_use-industry-employment-growth.xlsx",
sheet = "Growth from Industry Transition")
industrySkillsNeeds <- read_excel("data/public_use-industry-skills-needs.xlsx",
sheet = "Industry Skills Needs")
industryMigration <- read_excel("data/public_use-talent-migration.xlsx",
sheet = "Industry Migration")
skillMigration <- read_excel("data/public_use-talent-migration.xlsx",
sheet = "Skill Migration")
gdpPerCapita <- read_csv("data/API_NY.GDP.PCAP.CD_DS2_en_csv_v2_2055804.csv")
After loading the datasets, we will do the necessary data cleaning and manipulation.
The GDP per capita dataset contains GDP per capita for various countries over many years. First, we change the column name from “Country Name” to “country_name” to be consistent with other datasets. We also change the column names e.g. from “2019” to “gpc2019”, as the numerical labelling may result in difficulties in data wrangling. Only relevant columns are kept i.e. country, GDP per capita from 2014 to 2019. Next, to determine the change in GDP per capita year on year for 2015 to 2019, we apply the formulas and keep only relevant columns i.e. country, GDP per capita growth for 2015 to 2019. Lastly, we use pivot_longer to place GDP per capita growth in rows instead of columns. The name of the column storing the original values of the columns is titled “GDP_per_capita_growth” and the name of the column storing the original names of the columns is titled “year”. We also edit to keep only values of the year i.e. from “gpc2019” to “2019”.
names(gdpPerCapita)[1] <- "country_name"
names(gdpPerCapita)[59] <- "gpc2014"
names(gdpPerCapita)[60] <- "gpc2015"
names(gdpPerCapita)[61] <- "gpc2016"
names(gdpPerCapita)[62] <- "gpc2017"
names(gdpPerCapita)[63] <- "gpc2018"
names(gdpPerCapita)[64] <- "gpc2019"
gdpPerCapita <- gdpPerCapita[-c(2:58, 65)]
gdpPerCapita$g2015 <- (gdpPerCapita$gpc2015/gdpPerCapita$gpc2014) - 1
gdpPerCapita$g2016 <- (gdpPerCapita$gpc2016/gdpPerCapita$gpc2015) - 1
gdpPerCapita$g2017 <- (gdpPerCapita$gpc2017/gdpPerCapita$gpc2016) - 1
gdpPerCapita$g2018 <- (gdpPerCapita$gpc2018/gdpPerCapita$gpc2017) - 1
gdpPerCapita$g2019 <- (gdpPerCapita$gpc2019/gdpPerCapita$gpc2018) - 1
gdpPerCapita <- gdpPerCapita[-c(2:7)] %>%
pivot_longer(cols = c(`g2015`, `g2016`, `g2017`, `g2018`, `g2019`),
names_to = "year",
values_to = "GDP_per_capita_growth") %>%
mutate(year = str_sub(year, 2, -1))
The industry employment growth dataset contains employment growth rate for various industries in various countries from 2015 to 2019. We will keep only relevant columns i.e. country, region, income, industry, employment growth for 2015 to 2019. Next, we use pivot_longer to place industry employment growth in rows instead of columns. The name of the column storing the original values of the columns is titled “employment_growth” and the name of the column storing the original names of the columns is titled “year”. We also edit to keep only values of the year i.e. from “growth_rate_2019” to “2019”.
The industry migration dataset contains migration data for various industries in various countries from 2015 to 2019. We will keep only relevant columns i.e. country, region, income, industry, migration for 2015 to 2019. Next, we use pivot_longer to place industry migration in rows instead of columns. The name of the column storing the original values of the columns is titled “industry_migration” and the name of the column storing the original names of the columns is titled “year”. We also edit to keep only values of the year i.e. from “net_per_10k_2019” to “2019”. Lastly, we will normalise industry migration by dividing by 1000 i.e. industry migration (net per 10) to have a common scale with GDP per capita growth and industry employment growth.
industryMigration <- industryMigration[-c(1, 5, 7)] %>%
pivot_longer(cols = c(`net_per_10K_2015`, `net_per_10K_2016`, `net_per_10K_2017`,
`net_per_10K_2018`, `net_per_10K_2019`),
names_to = "year",
values_to = "industry_migration") %>%
mutate(year = str_sub(year, 13, -1)) %>%
mutate(industry_migration = industry_migration / 1000)
The skill migration dataset contains migration data for various skills in various countries from 2015 to 2019. We will keep only relevant columns i.e. country, region, income, skill, skill migration for 2015 to 2019. Next, we use pivot_longer to place skill migration in rows instead of columns. The name of the column storing the original values of the columns is titled “skill_migration” and the name of the column storing the original names of the columns is titled “year”. We also edit to keep only values of the year i.e. from “net_per_10k_2019” to “2019”. Lastly, we will normalise skill migration by dividing by 1000 i.e. skill migration (net per 10) to have a common scale with GDP per capita growth and industry employment growth.
skillMigration <- skillMigration[-c(1, 5)] %>%
pivot_longer(cols = c(`net_per_10K_2015`, `net_per_10K_2016`, `net_per_10K_2017`,
`net_per_10K_2018`, `net_per_10K_2019`),
names_to = "year",
values_to = "skill_migration") %>%
mutate(year = str_sub(year, 13, -1)) %>%
mutate(skill_migration = skill_migration / 1000)
The industry skills need dataset contains the top 10 skill needs for each industry from 2015 to 2019. We will merge skill migration and industry skill need datasets to form the industrySkillMigration data table and show the industries which the skills are needed in.
Various datasets are then merged to have various visualisations of regression and correlation. They are as follows.
master1 <- merge(industryEmploymentGrowth, gdpPerCapita,
by = c("country_name", "year"))
master2 <- merge(industryMigration, gdpPerCapita,
by = c("country_name", "year"))
master3 <- merge(skillMigration, gdpPerCapita,
by = c("country_name", "year"))
master4 <- merge(industryEmploymentGrowth, industryMigration,
by = c("country_name", "year", "wb_region", "wb_income",
"isic_section_name", "industry_name"))
master5 <- merge(industryEmploymentGrowth, industrySkillMigration,
by = c("country_name", "year", "wb_region", "wb_income",
"isic_section_name", "industry_name"))
master6 <- merge(master5, master2,
by = c("country_name", "year", "wb_region", "wb_income",
"isic_section_name", "industry_name"))
For simplicity in this assignment, we will use master6, even though there may be less observations when the data tables are merged.
In this section, we will use crosstalk and d3scatter packages to create interactive scatter plots. They are filtered by country, region, year, income level, industry section, industry, skill category and skill. By default, all options are chosen and multiple options can selected for the filters. They are also coloured by year to show the differences over the years. The limitations are the inability to adjust the orientation of the axes labels and to place the reference lines at zero for aesthetic purpose. Nonetheless, the interactive scatter plots are more for exploratory purpose for this assignment, so we won’t adjust the aesthetics of the scatter plots further. In addition, due to the long processing time of the interactive scatter plot and resulting large size of the R Markdown file, we will only run the code of one of the interactive scatter plot (Employment growth vs Industry migration). The remaining plots will be illustrated via image format.
The following scatter plots are plotted:
shared_master <- SharedData$new(master6)
bscols(widths = c(3,NA),
list(
filter_select("country_name", "Country",
shared_master, ~country_name,
allLevels = TRUE, multiple = TRUE),
filter_select("year", "Year",
shared_master, ~year,
allLevels = TRUE, multiple = TRUE),
filter_select("wb_region", "Region",
shared_master, ~wb_region,
allLevels = TRUE, multiple = TRUE),
filter_select("wb_income", "Income Level",
shared_master, ~wb_income,
allLevels = TRUE, multiple = TRUE),
filter_select("isic_section_name", "Industry Section",
shared_master, ~isic_section_name,
allLevels = TRUE, multiple = TRUE),
filter_select("industry_name", "Industry",
shared_master, ~industry_name,
allLevels = TRUE, multiple = TRUE),
filter_select("skill_group_category", "Skill Category",
shared_master, ~skill_group_category,
allLevels = TRUE, multiple = TRUE),
filter_select("skill_group_name", "Skill",
shared_master, ~skill_group_name,
allLevels = TRUE, multiple = TRUE)
),
d3scatter(shared_master, ~employment_growth, ~GDP_per_capita_growth, ~year,
x_label = "Employment Growth",
y_label = "GDP Per Capita Growth",
width = "100%", height = 500)
)
bscols(widths = c(3,NA),
list(
filter_select("country_name", "Country",
shared_master, ~country_name,
allLevels = TRUE, multiple = TRUE),
filter_select("year", "Year",
shared_master, ~year,
allLevels = TRUE, multiple = TRUE),
filter_select("wb_region", "Region",
shared_master, ~wb_region,
allLevels = TRUE, multiple = TRUE),
filter_select("wb_income", "Income Level",
shared_master, ~wb_income,
allLevels = TRUE, multiple = TRUE),
filter_select("isic_section_name", "Industry Section",
shared_master, ~isic_section_name,
allLevels = TRUE, multiple = TRUE),
filter_select("industry_name", "Industry",
shared_master, ~industry_name,
allLevels = TRUE, multiple = TRUE),
filter_select("skill_group_category", "Skill Category",
shared_master, ~skill_group_category,
allLevels = TRUE, multiple = TRUE),
filter_select("skill_group_name", "Skill",
shared_master, ~skill_group_name,
allLevels = TRUE, multiple = TRUE)
),
d3scatter(shared_master, ~industry_migration, ~GDP_per_capita_growth, ~year,
x_label = "Industry Migration (Net Per 10)",
y_label = "GDP Per Capita Growth",
width = "100%", height = 500)
)
bscols(widths = c(3,NA),
list(
filter_select("country_name", "Country",
shared_master, ~country_name,
allLevels = TRUE, multiple = TRUE),
filter_select("year", "Year",
shared_master, ~year,
allLevels = TRUE, multiple = TRUE),
filter_select("wb_region", "Region",
shared_master, ~wb_region,
allLevels = TRUE, multiple = TRUE),
filter_select("wb_income", "Income Level",
shared_master, ~wb_income,
allLevels = TRUE, multiple = TRUE),
filter_select("isic_section_name", "Industry Section",
shared_master, ~isic_section_name,
allLevels = TRUE, multiple = TRUE),
filter_select("industry_name", "Industry",
shared_master, ~industry_name,
allLevels = TRUE, multiple = TRUE),
filter_select("skill_group_category", "Skill Category",
shared_master, ~skill_group_category,
allLevels = TRUE, multiple = TRUE),
filter_select("skill_group_name", "Skill",
shared_master, ~skill_group_name,
allLevels = TRUE, multiple = TRUE)
),
d3scatter(shared_master, ~skill_migration, ~GDP_per_capita_growth, ~year,
x_label = "Skill Migration (Net Per 10)",
y_label = "GDP Per Capita Growth",
width = "100%", height = 500)
)
bscols(widths = c(3,NA),
list(
filter_select("country_name", "Country",
shared_master, ~country_name,
allLevels = TRUE, multiple = TRUE),
filter_select("year", "Year",
shared_master, ~year,
allLevels = TRUE, multiple = TRUE),
filter_select("wb_region", "Region",
shared_master, ~wb_region,
allLevels = TRUE, multiple = TRUE),
filter_select("wb_income", "Income Level",
shared_master, ~wb_income,
allLevels = TRUE, multiple = TRUE),
filter_select("isic_section_name", "Industry Section",
shared_master, ~isic_section_name,
allLevels = TRUE, multiple = TRUE),
filter_select("industry_name", "Industry",
shared_master, ~industry_name,
allLevels = TRUE, multiple = TRUE),
filter_select("skill_group_category", "Skill Category",
shared_master, ~skill_group_category,
allLevels = TRUE, multiple = TRUE),
filter_select("skill_group_name", "Skill",
shared_master, ~skill_group_name,
allLevels = TRUE, multiple = TRUE)
),
d3scatter(shared_master, ~industry_migration, ~employment_growth, ~year,
x_label = "Industry Migration (Net Per 10)",
y_label = "Employment Growth",
width = "100%", height = 500)
)
bscols(widths = c(3,NA),
list(
filter_select("country_name", "Country",
shared_master, ~country_name,
allLevels = TRUE, multiple = TRUE),
filter_select("year", "Year",
shared_master, ~year,
allLevels = TRUE, multiple = TRUE),
filter_select("wb_region", "Region",
shared_master, ~wb_region,
allLevels = TRUE, multiple = TRUE),
filter_select("wb_income", "Income Level",
shared_master, ~wb_income,
allLevels = TRUE, multiple = TRUE),
filter_select("isic_section_name", "Industry Section",
shared_master, ~isic_section_name,
allLevels = TRUE, multiple = TRUE),
filter_select("industry_name", "Industry",
shared_master, ~industry_name,
allLevels = TRUE, multiple = TRUE),
filter_select("skill_group_category", "Skill Category",
shared_master, ~skill_group_category,
allLevels = TRUE, multiple = TRUE),
filter_select("skill_group_name", "Skill",
shared_master, ~skill_group_name,
allLevels = TRUE, multiple = TRUE)
),
d3scatter(shared_master, ~skill_migration, ~employment_growth, ~year,
x_label = "Skill Migration (Net Per 10)",
y_label = "Employment Growth",
width = "100%", height = 500)
)
In this assignment, we will focus on 2019 dataset for information and communication industry or tech skills, given the relevance to SMU MITB students. By applying the filters i.e. year (2019), industry section (information and communication) and skill category (tech skills), the above interactive scatter plots will produce Figures 4-8.
Next, we will edit the code to filter master6. Two filtered data tables are created, one filtered by year 2019 and information and communication industry and the other filtered by year 2019 and tech skills. These will be used to find specific results under the regression and correlation plots. With the results, countries can find out how GDP per capita growth is impacted by changes in employment and migration for information and communication industry and tech skill. Users can also find out the relationship between employment growth and migration for information and communication industry and tech skill to find potential job opportunities.
Using the filtered data tables, we will create the respective regression plots as seen below. ggstatsplot package is used instead of ggplot package, as it allows histograms to be plotted (marginal distributions) and we can analyse the distributions of both variables in the regression plots.
For the regression plots, the title, axes (xlab, ylab) and source (caption) are labelled appropriately. The y axis label is rotated to improve readability. The plots are coloured by year. Note that since the filtered data tables contain data only from year 2019, there will only be one colour seen here. Reference lines are drawn at zero to allow users to distinguish between positive and negative easily. The default 95% confidence interval is used and the results of the statistical tests are displayed. From the plots, we note that the distributions of all the variables follow a normal distribution. We will further discuss the results in the later section.
In addition, we will use the olsrr package to show the results of each regression plot, as it’s easier to understand and interpret.
ggscatterstats(filtered_master1, employment_growth, GDP_per_capita_growth,
title = "Relationship Between GDP Per Capita Growth & Employment Growth",
caption = "Source: LinkedIn & World Bank Group",
ggplot.component = list(ggplot2::
aes(employment_growth, GDP_per_capita_growth,
colour = year),
xlab("Employment Growth"),
ylab("GDP Per\nCapita\nGrowth"),
theme(axis.title.y = element_text(angle = 0)),
geom_hline(yintercept = 0,
linetype = "dashed",
color = "grey60",
size = 1),
geom_vline(xintercept = 0,
linetype = "dashed",
color = "grey60",
size = 1)))

ols_regress(GDP_per_capita_growth ~ employment_growth, filtered_master1)
Model Summary
-----------------------------------------------------------------
R 0.067 RMSE 0.052
R-Squared 0.004 Coef. Var -1103.029
Adj. R-Squared 0.004 MSE 0.003
Pred R-Squared 0.004 MAE 0.036
-----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 0.075 1 0.075 27.976 0.0000
Residual 16.824 6267 0.003
Total 16.899 6268
---------------------------------------------------------------------
Parameter Estimates
----------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
----------------------------------------------------------------------------------------------
(Intercept) -0.004 0.001 -6.351 0.000 -0.005 -0.003
employment_growth -0.144 0.027 -0.067 -5.289 0.000 -0.197 -0.091
----------------------------------------------------------------------------------------------
ggscatterstats(filtered_master1, industry_migration, GDP_per_capita_growth,
title = "Relationship Between GDP Per Capita Growth & Industry Migration",
caption = "Source: LinkedIn & World Bank Group",
ggplot.component = list(ggplot2::
aes(industry_migration, GDP_per_capita_growth,
colour = year),
xlab("Industry Migration (Net Per 10)"),
ylab("GDP Per\nCapita\nGrowth"),
theme(axis.title.y = element_text(angle = 0)),
geom_hline(yintercept = 0,
linetype = "dashed",
color = "grey60",
size = 1),
geom_vline(xintercept = 0,
linetype = "dashed",
color = "grey60",
size = 1)))

ols_regress(GDP_per_capita_growth ~ industry_migration, filtered_master1)
Model Summary
-----------------------------------------------------------------
R 0.108 RMSE 0.052
R-Squared 0.012 Coef. Var -1099.075
Adj. R-Squared 0.011 MSE 0.003
Pred R-Squared 0.011 MAE 0.035
-----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 0.196 1 0.196 73.357 0.0000
Residual 16.704 6267 0.003
Total 16.899 6268
---------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------
(Intercept) -0.004 0.001 -6.049 0.000 -0.005 -0.003
industry_migration -0.030 0.003 -0.108 -8.565 0.000 -0.036 -0.023
-----------------------------------------------------------------------------------------------
ggscatterstats(filtered_master2, skill_migration, GDP_per_capita_growth,
title = "Relationship Between GDP Per Capita Growth & Skill Migration",
caption = "Source: LinkedIn & World Bank Group",
ggplot.component = list(ggplot2::
aes(skill_migration, GDP_per_capita_growth,
colour = year),
xlab("Skill Migration (Net Per 10)"),
ylab("GDP Per\nCapita\nGrowth"),
theme(axis.title.y = element_text(angle = 0)),
geom_hline(yintercept = 0,
linetype = "dashed",
color = "grey60",
size = 1),
geom_vline(xintercept = 0,
linetype = "dashed",
color = "grey60",
size = 1)))

ols_regress(GDP_per_capita_growth ~ skill_migration, filtered_master2)
Model Summary
----------------------------------------------------------------
R 0.129 RMSE 0.050
R-Squared 0.017 Coef. Var -810.651
Adj. R-Squared 0.017 MSE 0.003
Pred R-Squared 0.016 MAE 0.034
----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 0.237 1 0.237 93.927 0.0000
Residual 13.971 5537 0.003
Total 14.208 5538
---------------------------------------------------------------------
Parameter Estimates
--------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
--------------------------------------------------------------------------------------------
(Intercept) -0.005 0.001 -7.341 0.000 -0.006 -0.004
skill_migration -0.037 0.004 -0.129 -9.692 0.000 -0.044 -0.029
--------------------------------------------------------------------------------------------
ggscatterstats(filtered_master1, industry_migration, employment_growth,
title = "Relationship Between Employment Growth & Industry Migration",
caption = "Source: LinkedIn & World Bank Group",
ggplot.component = list(ggplot2::
aes(industry_migration, employment_growth,
colour = year),
xlab("Industry Migration (Net Per 10)"),
ylab("Employment\nGrowth"),
theme(axis.title.y = element_text(angle = 0)),
geom_hline(yintercept = 0,
linetype = "dashed",
color = "grey60",
size = 1),
geom_vline(xintercept = 0,
linetype = "dashed",
color = "grey60",
size = 1)))

ols_regress(employment_growth ~ industry_migration, filtered_master1)
Model Summary
---------------------------------------------------------------
R 0.655 RMSE 0.019
R-Squared 0.429 Coef. Var 733.385
Adj. R-Squared 0.428 MSE 0.000
Pred R-Squared 0.428 MAE 0.014
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
-----------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-----------------------------------------------------------------------
Regression 1.680 1 1.680 4803.486 0.0000
Residual 2.240 6406 0.000
Total 3.920 6407
-----------------------------------------------------------------------
Parameter Estimates
--------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
--------------------------------------------------------------------------------------------
(Intercept) 0.002 0.000 7.822 0.000 0.001 0.002
industry_migration 0.074 0.001 0.655 69.307 0.000 0.072 0.076
--------------------------------------------------------------------------------------------
ggscatterstats(filtered_master2, skill_migration, employment_growth,
title = "Relationship Between Employment Growth & Skill Migration",
caption = "Source: LinkedIn & World Bank Group",
ggplot.component = list(ggplot2::
aes(skill_migration, employment_growth,
colour = year),
xlab("Skill Migration (Net Per 10)"),
ylab("Employment\nGrowth"),
theme(axis.title.y = element_text(angle = 0)),
geom_hline(yintercept = 0,
linetype = "dashed",
color = "grey60",
size = 1),
geom_vline(xintercept = 0,
linetype = "dashed",
color = "grey60",
size = 1)))

ols_regress(employment_growth ~ skill_migration, filtered_master2)
Model Summary
---------------------------------------------------------------
R 0.498 RMSE 0.019
R-Squared 0.248 Coef. Var 632.051
Adj. R-Squared 0.248 MSE 0.000
Pred R-Squared 0.248 MAE 0.013
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
-----------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-----------------------------------------------------------------------
Regression 0.647 1 0.647 1866.413 0.0000
Residual 1.958 5648 0.000
Total 2.604 5649
-----------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------
(Intercept) 0.002 0.000 8.863 0.000 0.002 0.003
skill_migration 0.048 0.001 0.498 43.202 0.000 0.046 0.050
-----------------------------------------------------------------------------------------
Next, we will find out the correlation of the variables. Given that there are very few variables (only 4), we will use ggcorrmat (under ggstatsplot package) instead of corrplot and other packages, as it will produce a simple correlation plot. We will use the default properties i.e. upper triangular matrix and significance level of 0.05. Both filtered data tables are used for the correlation plots.
It is noted that GDP per capita growth is weakly correlated with employment growth, industry and skill migration. Industry and skill migration are highly correlated (>0.8) for both plots. The results will be further discussed in the later section.
ggcorrmat(filtered_master1,
cor.vars = c(7, 10, 11, 12))

ggcorrmat(filtered_master2,
cor.vars = c(7, 10, 11, 12))

In addition to the simple regression plots, we will build a multiple regression model in this assignment using the olsrr package to study the relationship between GDP per capita growth and multiple variables i.e. employment growth, industry and skill migration. 3 multiple regression models are created, one with employment growth, industry and skill migration as dependent variables and we will remove industry migration or skill migration for the other two, as they are highly correlated, as seen in the earlier correlation plots.
All the regression models show weak relationship (low R-squared values) between GDP per capita growth and the dependent variables. We will further discuss the results in the later section.
ols_regress(GDP_per_capita_growth ~ employment_growth + industry_migration
+ skill_migration,
filtered_master1)
Model Summary
-----------------------------------------------------------------
R 0.118 RMSE 0.052
R-Squared 0.014 Coef. Var -1097.910
Adj. R-Squared 0.014 MSE 0.003
Pred R-Squared 0.013 MAE 0.035
-----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 0.236 3 0.079 29.608 0.0000
Residual 16.663 6265 0.003
Total 16.899 6268
---------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------
(Intercept) -0.004 0.001 -6.108 0.000 -0.005 -0.003
employment_growth 0.000 0.035 0.000 0.013 0.990 -0.068 0.069
industry_migration -0.012 0.006 -0.043 -1.844 0.065 -0.024 0.001
skill_migration -0.022 0.006 -0.081 -3.911 0.000 -0.033 -0.011
-----------------------------------------------------------------------------------------------
ols_regress(GDP_per_capita_growth ~ employment_growth + industry_migration,
filtered_master1)
Model Summary
-----------------------------------------------------------------
R 0.108 RMSE 0.052
R-Squared 0.012 Coef. Var -1099.162
Adj. R-Squared 0.011 MSE 0.003
Pred R-Squared 0.011 MAE 0.035
-----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 0.196 2 0.098 36.678 0.0000
Residual 16.704 6266 0.003
Total 16.899 6268
---------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------
(Intercept) -0.004 0.001 -6.038 0.000 -0.005 -0.003
employment_growth 0.004 0.035 0.002 0.105 0.917 -0.065 0.072
industry_migration -0.030 0.004 -0.109 -6.722 0.000 -0.039 -0.021
-----------------------------------------------------------------------------------------------
ols_regress(GDP_per_capita_growth ~ employment_growth + skill_migration,
filtered_master1)
Model Summary
-----------------------------------------------------------------
R 0.116 RMSE 0.052
R-Squared 0.013 Coef. Var -1098.120
Adj. R-Squared 0.013 MSE 0.003
Pred R-Squared 0.012 MAE 0.035
-----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 0.227 2 0.114 42.695 0.0000
Residual 16.672 6266 0.003
Total 16.899 6268
---------------------------------------------------------------------
Parameter Estimates
----------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
----------------------------------------------------------------------------------------------
(Intercept) -0.004 0.001 -6.197 0.000 -0.005 -0.003
employment_growth -0.029 0.031 -0.013 -0.922 0.356 -0.090 0.032
skill_migration -0.030 0.004 -0.109 -7.561 0.000 -0.037 -0.022
----------------------------------------------------------------------------------------------
Finally, we also carry out stepwise regression to remove variable(s) that is not important to the regression model, based on the p value. The olsrr package (ols_step_backward_aic) is used to carry out backward stepwise regression.
ols_step_backward_aic(lm(GDP_per_capita_growth ~ employment_growth + industry_migration
+ skill_migration, filtered_master1))
Backward Elimination Summary
---------------------------------------------------------------------------
Variable AIC RSS Sum Sq R-Sq Adj. R-Sq
---------------------------------------------------------------------------
Full Model -19375.641 16.663 0.236 0.01398 0.01351
employment_growth -19377.640 16.663 0.236 0.01398 0.01366
---------------------------------------------------------------------------
As seen in the backward stepwise regression result, employment growth is removed. We will build a regression model without employment growth to explore the results. However, the regression model still shows a weak relationship (low R-squared values) between GDP per capita growth and the dependent variables. We will further discuss the results in the later section.
ols_regress(GDP_per_capita_growth ~ industry_migration + skill_migration, filtered_master1)
Model Summary
-----------------------------------------------------------------
R 0.118 RMSE 0.052
R-Squared 0.014 Coef. Var -1097.822
Adj. R-Squared 0.014 MSE 0.003
Pred R-Squared 0.013 MAE 0.035
-----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 0.236 2 0.118 44.419 0.0000
Residual 16.663 6266 0.003
Total 16.899 6268
---------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------
(Intercept) -0.004 0.001 -6.126 0.000 -0.005 -0.003
industry_migration -0.012 0.006 -0.043 -2.062 0.039 -0.023 -0.001
skill_migration -0.022 0.006 -0.081 -3.913 0.000 -0.033 -0.011
-----------------------------------------------------------------------------------------------
Subsequently, we will combine the interactive filters, regression plot and correlation plot on R Shiny app to produce an interactive, user-friendly dashboard. Figure 9 shows the final dashboard or data visualisation.
Finally, we discuss the results for the filtered dataset, which is for year 2019, information and communication industry and tech skills.